Better Nonlinear Models from Noisy Data: Attractors with 

Maximum Likelihood 



A new approach to nonlinear modelling is presented which, by incorporating 
the global behaviour of the model, lifts shortcomings of both least squares and 
total least squares parameter estimates. Although ubiquitous in practice, a 
least squares approach is fundamentally flawed in that it assumes independent, 
normally distributed (IND) forecast errors: nonlinear models will not yield 
IND errors even if the noise is IND. A new cost function is obtained via the 
maximum likelihood principle; superior results are illustrated both for small 
data sets and infinitely long data streams. 
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A nonlinear model must be tuned via parameter estimation, ideally forcing it to mimic 
the observations. Typically, tuning aims for parameters which yield the least squared error 
]1|-|3[ (or total least squared error ||||) between the one-step forecasts and the data. After 
proving that even for the simplest nonlinear models, both least squares and total least squares 
systematically reject the correct parameter values (i.e. those that generated the data), a new 
and more robust method is derived which incorporates the global dynamics of the model 
(and hence its attractor). Failure to recognise the effects of imperfect observations will lead 
to biased parameter estimates, whereas an inability to reflect the data indicates model error. 
The present Letter focuses on the first issue: using the fact that one can always estimate 
the probability density function (PDF) of the model-state variable for different values of 
the unknown parameters, the maximum likelihood principle is employed to derive a new 
cost function which incorporates this information. This global approach has the potential 
to outperform all one-step (or few-step) methods, whether they are based on least squares 
criteria or some future improvement. Note that even with an infinite amount of data, the 
optimal least squares solution is simply incorrect; see Fig. [1] and the discussion below. 
The new cost function is shown to yield results consistent with the correct answer even for 
relatively small data sets and large noise levels in a variety of chaotic systems. It is applicable 
to high dimensional systems and may also be applied to nonlinear stochastic systems. 

Suppose the evolution of a system's state variable, x, e ]R m , is governed by the map 

x m = F(xi,a), (1) 

where the model's parameters are contained in the vector a 6 IPJ. For m — 1, the system 
state Xi is a scalar; assuming additive measurement noise rji yields observations Sj = Xi + rji. 
In a noise free setting (i.e. rji = V i), I + 1 sequential measurements Sj, Sj+i, Si + i would, 
in general, be sufficient to determine a. With noise, the PDF of both the measurement noise 
and the model-state variables are required to estimate a properly. 

Given the correct model structure -F(x, a) and data generated by particular parameters 
a (the 'true' parameter values), a cost function is often used to obtain an estimate of the 
unknown parameters ao- Ideally, this estimate converges to ao in the limit of an infinite 
number of observations. Complications of nonlinearity combined with the randomness of 
unavoidable measurement errors suggests a likelihood analysis for parameter estimation. 
Indeed, both least squares and total least squares are special cases of the likelihood method. 

The one-step least squares (LS) estimate, a L5 , is the value of a which minimises the least 
squares cost function 

N-l 

C LS (a) = £ El (2) 

i=l 

where Ei = s i+1 — F(si, a), the one-step prediction error. Figure [I] shows the failure of this 
well known technique when applied to the well known Logistic Map ||: a = 2, yet a LS < 2 
at all nonzero noise levels, even for an infinite data set. To see this, first recall this one- 
dimensional map F(x,a) = 1 — ax 2 . Writing the observed prediction error, s i+ i — F(si,a), 
explicitly in terms of the underlying system state and a realisation of the noise process 
yields Ei = r) i+ i — a x 2 + a(xj + rji) 2 . When the least squares cost function (0) is a minima 
J2iLi [EidEi/ da] = 0, and in the limit of an infinitely long data set this sum converges to an 
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integral over x taken with respect to the system's invariant measure, /i(x,a). For a = 2 in 
the Logistic map, fi(x, 2) = l/ny/l — x 2 for — 1 < x < 1 and zero otherwise. Thus (x 2 ) = 
1/2 and (a; 4 ) = 3/8. Since (x n ) = for odd n 2) is an even function), if the distribution 
of the noise process is also even, then 

_ ( 4(r? 2 ) + 3 \ 
aLS -{8(^ + 24^)+ 3) ^ (3) 

Equation @ yields the parameter estimate corresponding to an infinitely long data set as a 
function of noise level. For uniformly distributed noise [i.e. rj = U(—e, e)], (rf 1 ) = e n /(n + 1) 
for even n, while for normally distributed noise [i.e. r\ = iV(0, e 2 )], (rj 2 ) = e 2 and (rj 4 ) = 3e 4 , 
The noise level is defined as cr noise /a signa i, where o~ 2 oise and o~ 2 signal are the variances of the 
noise and the signal, respectively; for the uniform case o~ 2 oise = e 2 /3, whereas cr 2 oise = e 2 
for the normal case. Despite having a complete knowledge of the measurement process and 
data of infinite duration, the parameter estimates are biased. 

Let (xi,Xi + i) denote a successive pair of system states corresponding to observations 
(si, s i+1 ) where 

s i =x i + r ]i , Vi = N(0,e 2 ). (4) 

The system variables (xj, x i+ i) are sometimes called latent variables, due to the fact that they 
cannot be measured directly 0. They are fixed yet unknown, and therefore all probabilities 
ought to be conditional on the x«. In general, the probability of observing the pair (sj, 
depends on the model parameters a, the system variable Xi at time i, its image F(xi,a), 
and the measurement process. Specifically 

P( Si ,s i+1 |a,ar) = ^exp^-^J J (5) 

where 

d 2 (si, s i+1 , x, a) = (si - xf + (s i+1 - F(x, a)) 2 . (6) 

Assuming the and Sj + i are independent, the probability of observing a sequence of N — 1 
pairs, S = {(sj, s^)}^ 1 , corresponding with a particular set of model-states X = {xi}^' 1 
is given by the joint PDF: 



h=i 



N-l 



P(S|a,X) = J] P( Sj , Sm |a,^). (7) 



i=l 



Identifying the likelihood of parameters a generating the data S with the probability of 
observing data S given that the model has parameters a (see ||) yields 

L(a,X|S) =P(S|a,X), (8) 

where the conditional status of the model-state variables is explicit. Substituting (||) and 
(0) in (|) yields 
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L(a ' X|S) = (2^)^-i 6XP [~2* g dl ) ■ (9) 

L(a, X|S) depends on the PDF of the likely model-states and the parameters which maximise 
(|) will vary with the assumptions made regarding this distribution. These assumptions are 
paramount to this letter; ignoring information from the distribution will lead to total least 
squares (TLS), whereas requiring consistency between the data and the PDF of the model- 
state variables yields the new cost function below. Casella and Berger compare these 
assumptions for linear systems. 

Ignoring the PDF of the model-state variables, total least squares resolves the dependence 
on Xi by substituting any values Xi which maximise L(a|S), that is: 

1 / 1 N - 1 \ 

L(a|S) = (2^-1 eXp [~2? § Sgg^J ■ (10) 

The maximum of L(a, S) then corresponds to the minimum of the associated TLS cost 
function 

N-l 

C TLS (a) = £mm4. (11) 
i=i 

Thus, while the least squares cost function (|2|) minimises the squared vertical distances 
df = [sj+i — F(sj,a)] 2 , the TLS solution minimises the squared perpendicular distances © 
between the measured point (s$, s i+1 ) and a point on the hyper-surface (x,F(x, a)). No 
restrictions are placed on the values Xi, i — 1, N: the x-i are not a trajectory of F(x, a), nor 
do they reflect /i(x, a). Using the particular Xi which minimises each d\ reflects the decision 
to ignore any knowledge of the PDF of the model-state variables. But, given that the model 
is always in hand the PDF of the model, /i(x, a), is always obtainable (not that of the 
system, but of the model). This additional information may be incorporated by integrating 
the dependence on x^ out of the likelihood function in (^), yielding 

N-l . 

L(a|S) = Y[ / P(si,s i+1 \a,x)dii(x,a), (12) 
i=i Jx 

and associated maximum likelihood (ML) cost function: 

N ^ f ( d 2 \ 
C ML (a) = - Yl lo S J x ex P d ^ x ' a )' ( 13 ) 

Equation (|i~3|) is the main result of this Letter; it is superior to both Cls and Ctls- In 
practice, the integral in (|T3| ) is usually replaced by a sum over a model trajectory whose 
length, r >> N, is limited only by computational constraints. Thus 



N-l / T , J 

C ML (a) w - 1() g ( S ex P 1 - 

i=l \A;=1 1 ZC 



Si Xfc) 



+ (s l+1 -x k+1 ) 2 , (14) 
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where Xk = F k (xo,SL) and xq is any post transient value F(x,a) in the relevant basis of 
attraction. For fixed N this can be computed efficiently ||. As r — > oo, the particular value 
of Xq is irrelevant since the sum is dominated by those values of k for which Sj ~ Xk and 
s i+ i m the particular values of k being irrelevant. 

The Logistic map's invariant measure varies drastically for different values of a. The 
bifurcation diagram (see Fig. |2|a) illustrates this behaviour in the range 1.5 < a < 2. With 
ao = 2 and a noise level of 0.19, clls = 1-7 (see Fig. |I|); note from Fig. |2]a that this 
corresponds to a model with a period three orbit. Obviously, values of a corresponding 
with periodic windows are unlikely to be responsible for a data set with wildly aperiodic 
behaviour. Equation ([14]) allows this visually obvious result to be included implicitly into a 
cost function. 

The TLS cost function for the Logistic map may be obtained analytically by solving the 
cubic equation 

dd 2 

= 4a 2 x\ + [2 + 4(s m - l)a]xi - 2 Sl = 0, (15) 

and taking the root satisfying d 2 d 2 /dx 2 > 0. Results for three different cost functions are 
shown in Fig. 0. While the TLS cost function is much better than that of LS, the ML cost 
function is better still for all cases considered. Results are shown for data sets N = 100; for 
larger N the TLS solution improves, but in all cases tested the spread of the distribution 
remains smaller for the ML estimate. The simplicity of the Logistic map make it a weak 
test case and motivates further trials. 



The Moran-Ricker map [10] has a functional form F(x, a) = xexp[a(l — x)]; it's invariant 
measure (not shown) allows larger values of x with larger parameter values a. Figure |3| 
illustrates the results for each cost functions where ao = 3.7 and N = 100. The ML cost 
function consistently yields the best estimates for all noise levels considered. While it is 
common to claim an algorithm generalises to higher dimensional cases, this algorithm is 
easily generalised to m > 1 by substituting the term [||sj — H(x fc )||] for the term [(sj — x^) 2 } 
in equation (]T4|). Here the function H projects the system state vector x into the space 



of observations s. In delay coordinates, this corresponds to taking the 'last' component 



of Xfc. Fig. H shows results for the two-dimensional Henon map []nj] in delay coordinates, 
F(xi,Xi-i) = 1 — ax 2 + bxi-i, where ao = 1.4 and bo = 0.3. While the ML cost function 
surface is more highly structured due to sensitivity to the parameters, its minima are in the 
relevant regions as opposed to the smooth but incorrect LS minimum. The LS cost function 
has a biased minima, while ML is consistent with (ao,&o)- Whether this consistency is 
worth the computation depends on the problem at hand. Certainly the fact that ensemble 
forecasts of chaotic models using the ML estimates will relax naturally to a distribution 



consistent with w(x, a) is of value [12|. Better estimates of a also allow improved long term 
deterministic forecasts. The fact that higher dimensional models may require much larger 
data sets is a problem of uniqueness under the observations and cannot be laid at the door 
of the cost function. In Figure 4, N = 500 and the noise level is 0.05. Equation (|14"|) also 
allows an estimate of the magnitude of dynamical noise in a stochastic system [|12|] when the 
shape of the distribution of the noise is correctly specified. 

Both least squares (0) and total least squares (JIT]), are inferior to the maximum likelihood 



cost function (|T3|). Including the information on the invariant measure of the model aids in 
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the pursuit of reliable parameter estimates for nonlinear models. The weakest link in the 
derivation of the ML cost function is the assumption that the and Sj + i are independent, 
while they may be linearly uncorrelated, they cannot be independent. Attempts to relax 
this assumption will be presented in later work; here we note that (|H|) may be viewed as 
the first member of the family of cost functions 

^ N ~ n r ( d (n)2 \ 
Cffi(a) = £ / exp [ ) a), (16) 



i=l 



where 



2 n 

dt ] =E[^-^'(^' a )] 2 - (17) 
i=o 



For n > C^I(sl) is a multi-step cost function |3|,[T3|1 moderating the assumption that the Sj 
are independent, the aim being to find parameter values which both have the correct PDF 
and shadow the observations [p~2], 13 . The aultimate n = N — 1 multi-step least squares 



approach, solving simultaneously for xq and a, may prove intractable even for iV = 500 (see 
p~5f| ) . Future work will also focus upon the choice of optimal model order in local polynomial 



prediction , and the interpretation of cost functions when the underlying model structure 
is unknown. 

This work was supported by EC grant ERBFMBICT950203, ONR grant N00014-99-1- 
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FIGURES 



FIG. 1. Least squares estimate of a as a function of noise level using the analytic result @ 
corresponding to an infinite data set. The underlying system is the Logistic map with oo = 2. 
Parameter a is systematically underestimated for both normally distributed noise (solid), and uni- 
formly distributed noise (dashed). Both deviate significantly from the correct value (dot-dashed). 

FIG. 2. Logistic map: (a) bifurcation diagram illustrating the variation in fi(x,a), (b) distribu- 
tion of estimates contrasting the LS,TLS, and ML cost functions from 1000 realisations where ao 
= 1.85, N = 100, with normally distributed noise. The shading reflects the 95% limits, the solid 
line the mean. 

FIG. 3. Moran- Bicker map: a comparison of LS,TLS, and ML cost functions for 200 realisations 
with ao = 3.7, N = 100 with normally distributed measurement errors. The shading is as in Fig. 

I 

FIG. 4. Value of cost function in parameter space for a 2D delay reconstruction of the Henon 
map for ao = 1.4, 6q = 0.3, N = 500, and a noise level of 0.05: (a) Cls and (b) Cml- 
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Fig. 2, LC7579 McSharry 
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Fig. 4, LC7579 McSharry 
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